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Abstract Depth image based rendering techniques for 
multiview applications have been recently introduced 
for efficient view generation at arbitrary camera posi- 
tions. Encoding rate control has thus to consider both 
texture and depth data. Due to different structures of 
depth and texture images and their different roles on 
the rendered views, distributing the available bit bud- 
get between them however requires a careful analysis. 
Information loss due to texture coding affects the value 
of pixels in synthesized views while errors in depth in- 
formation lead to shift in objects or unexpected pat- 
terns at their boundaries. In this paper, we address the 
problem of efficient bit allocation between textures and 
depth data of multiview video sequences. We adopt a 
rate-distortion framework based on a simplified model 
of depth and texture images. Our model preserves the 
main features of depth and texture images. Unlike most 
recent solutions, our method permits to avoid render- 
ing at encoding time for distortion estimation so that 
the encoding complexity is not augmented. In addition 
to this, our model is independent of the underlying in- 
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painting method that is used at decoder. Experiments 
confirm our theoretical results and the efficiency of our 
rate allocation strategy. 

Keywords depth image based rendering • multiview 
video coding • rate allocation • rate-distortion analysis 



1 Introduction 

Three-dimensional video coding is a research field that 
has witnessed many technological revolutions in the re- 
cent years. One of them is the significant improvement 
in the capabilities of camera sensors. Nowadays, high 
quality camera sensors that capture color and depth in- 
formation are easily accessible jl] . Obviously this brings 
important modifications in the data that the 3D trans- 
mission systems have to process. A few years ago, trans- 
mission systems used disparity to improve the compres- 
sion performance [2j|3]. Nowadays, 3D systems rather 
employ depth information to improve the quality expe- 
rience by, for example, increasing the number of views 
that could be displayed at the receiver side [4j|5] • This is 
possible because of depth image based rendering (DIBR) 
techniques [6j[7] that project one reference image onto 
virtual views using depth as geometrical information. 
Figure [I] shows the overall structure of a DIBR mul- 
tiview coder that is also considered in this paper. It 
includes the following steps: first, the captured views in 
addition to their corresponding depth maps are coded 
at bit rates assigned by a rate allocation method. Then 
the coded information are transmitted to the decoder. 
Finally, at the decoder the reference views are decoded 
and virtual views are synthesized using the depth infor- 
mation. View synthesis consists of two parts; projection 
into the virtual view location using closest reference 
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views and inpainting for filling the holes [H][9] or pixels 
that remain undetermined after projection. 

DIBR techniques offer new possibilities but also im- 
pose new challenges. One of the important questions 
relies in the effect of depth compression on the view 
synthesis performance [10]; in particular, for a given bit 
budget i?, what is the best allocation between depth 
and texture data or in other words, how can we dis- 
tribute the total bitrate between color and geometrical 
information in order to maximize the rendering quality? 
It is important to note that the quality of the rendered 
view is of interest here, and not the distortion of depth 
images [10j[TT]. This renders the problem of rate allo- 
cation particularly challenging. 

The rate allocation problem has been the topic of 
many researches in the past few years. Allocating a fixed 
percentage of total budget to the texture and depth 
data is probably the simplest allocation policy in the 
DIBR coding methods [HlH3l[T4] . More efficient meth- 
ods have however been proposed recently, and we dis- 
cuss them in more details below. 

Starting from the current multiview coding (MVC) 
profile of H.264/AVC [15, 16, 3], we should mention that 
MVC uses the distortion of depth maps to distribute the 
available bit budget between texture and depth images. 
A group of papers try to improve MVC by taking into 
account depth properties. In [17], authors suggest a pre- 
processing step based on an adaptive local median filter 
to enhance spatial, temporal and inter- view correlations 
between depth maps and consequently, improve the per- 
formance of MVC. Using the correlation between refer- 
ence views, the work in [18] skips some depth blocks in 
the coding and hence, reduces the required bit budgets 
for coding depth maps. Other methods try to estimate 
at encoder the distortion of virtual views, which then 
replaces the depth map distortion in the mode decision 
step of the MVC method [15 . In [19 , the authors pro- 
vide an upper bound for virtual view distortion that 
is related to the depth and texture errors and the gra- 
dients of the original reference views. Another upper 
bound for rendered view distortion proposed when en- 
coder has access to the original intermediate views at 
the encoder [20 . In [21 , the algorithm calculates the 
translation error induced by depth coding and then es- 
timates the rendered view distortion from the texture 
data. In a similar approach, the work in [22] models the 
distortion at each pixel of a virtual view, including the 
pixels in occluded regions. These methods only try to 
improve the current MVC profile and without model- 
ing the distortion rate behavior, they can not be used 
as general solutions for the rate allocation problem. 

Beside improving the current MVC allocation pol- 
icy, other papers build a complete rate-distortion model 



to solve the rate allocation problem of distributing to- 
tal bit budget between texture and depth data in a 
DIBR multiview coder [23 l[24ll25l [26 l l27] . For example, 
assuming independency between depth and texture er- 
rors, the work in [23] proposes a DR function to find 
the optimal allocation in a video system with one refer- 
ence and one virtual view. A region-based approach for 
estimating the distortion at virtual views is proposed 
in [25] . Here, the allocation scheme is an iterative al- 
gorithm that needs to render one virtual view at every 
iteration for parameter initialization. This is very costly 
in terms of computational complexity. Along the same 
line of research, we also notice the rate allocation and 
view selection method proposed in [26]. In this work, 
the authors first provide a cubic distortion model for 
synthetic views; they estimate the model coefficients 
by rendering at least one intermediate view between 
each reference camera views. Then, using this distor- 
tion model, a DR function is formulated and a modi- 
fied search algorithm is executed to simplify rate alloca- 
tion. Finally, a DR function is provided for a layer-based 
depth coder in [27]. The main drawbacks in the above 
allocation schemes reside in the rendering of at least one 
virtual view at encoding time and in the construction 
of DR functions that are view dependent. Rendering at 
encoder side dramatically increases the computational 
complexity of the coder and is therefore not acceptable 
for realtime applications. In addition for view render- 
ing at arbitrary camera positions, multiview systems re- 
quire rate allocation strategies that work independently 
of reference and virtual view numbers and exact posi- 
tions. 

In this paper, we propose a novel DR model to solve 
the rate allocation problem in DIBR coding with arbi- 
trary number of reference and virtual views and without 
rendering at the encoder side. Inspired by [28 , 29, 30 , we 
first simplify different aspects of a multiview coder and 
keeping only the main features. In particular, we make 
simple models for depth and texture coders, camera 
setup and scene under observation. Then, using a rate- 
distortion framework, a DR function is calculated and 
eventually is used for optimizing the allocation prob- 
lem in multiview coding. An important property of our 
allocation method is that, we do not consider the in- 
painting step for virtual view synthesis at the decoder. 
There are two reasons for this choice: first, we want to 
design an allocation strategy that is independent of the 
actual inpainting method; second, we focus on the ef- 
fect of view projections, which is mostly related to the 
geometry of the scene. Experimental results show that 
our model-based rate allocation method is efficient for 
different system configurations. The approach proposed 
in this paper has low complexity but provides a distor- 
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Fig. 1 A DIBR multiview coder structure with p reference cameras and q equally spaced v: 
reference views. 
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tion that is not far from optimum, and in particular it 
outperforms a priori rate allocation strategies that are 
commonly used in practice. 

The organization of this paper is as follows. Next 
section clarifies some notations, camera and scene model 
and rate-distortion framework as it is used in Section 
[5] for calculation of our allocation model. Section [4] ad- 
dresses a few optimization issues. Finally, Section [5] in- 
cludes the details of our experimental results, parame- 
ter values and comparison to other allocation strategies. 

2 Framework and model 

In this section we define a few preliminary concepts 
that are used in our rate distortion study. Our main 
focus is the problem of distributing the available bit 
rate between several reference views and depth maps 
in a DIBR multiview video application, such that the 
distortion over all reference and rendered views at the 
decoder is minimized. In particular we are interested in 
constructing a rate-distortion for rate allocation with- 
out explicit view synthesis at the encoder. 

We first construct a rate-distortion model for a typi- 
cal wavelet-based texture coder and a simple quantized- 
based depth map coder, along with a simple model of 
scene. We present below some general notations and the 
wavelet framework. Then we describe our rate-distortion 
analysis framework, our model of the scene and of the 
camera. 

2.1 Notation 

Let R be the set of real numbers. The Z^-norm of a func- 
tion / : R 2 -> R is defined as ||/|| 2 = (// f 2 (t 1 ,t 2 )dt 1 dt 2 y 
Then, L 2 (R 2 ) is the set of all functions / : R 2 H> R 
with a finite L2-norm. The angle bracket represents the 
inner product of two functions in this space, i.e., for 
f,g e L 2 (R 2 ) we have 



Then, let (j> : R R and ip : R R be the uni- 
variate scaling and wavelet functions of an orthonormal 
wavelet transform, respectively [51] . The shifted and 
scaled forms of these functions are denoted by ip s ,n(t) = 
Tl 2 i)(2 s t - n) and (/) s ,n(t) = 2 s/2 0(2 s t - n), where 
5, n G Z are respectively the scaling and shifting pa- 
rameters and Z is the set of integer numbers. The most 
standard construction of two-dimensional wavelets re- 
lies on a separable design that uses &l ni n2 (ti,t2) = 

0s,ni(£l)^s,n 2 (*2), Ul ,n 2 (*1 M) = ^s,m (h)^s,n 2 &) , 

and &ln un2 (ti,t 2 ) = ^ s ,ni(£i)^s,n 2 (*2) as the bases. 
It is proved in [31] that separable wavelets provide an 
orthonormal basis for L2(R 2 ). Therefore, any function 
/ G L2(R 2 ) can be written as 

3 

f(h,t 2 ) = ^ ^ ,n 2 ^,m ,n 2 fa ^2), 

s,ni ,n 2 i=l 

where, for every s,n\,n 2 G Z, 

Cs,ni,n 2 = (/ 5^s,ni, n 2 )^ * = 1,2,3. 

Practically, the wavelet transform defines a scale sq 
as the largest scale value. If we call C\ ni n2 high fre- 
quency bands, at so we thus have only one low fre- 
quency band (/, ^ So ,m,n 2 ), where ^ So , ni ,n 2 (ti,t 2 ) = 

^s ,ni(^l)^s ,n 2 (^2). 

2.2 Scene and camera configuration model 

We use a very simple model of the scene in our analysis 
we consider foreground objects with arbitrary shapes 
and flat surfaces on a flat backgrouncQ Additionally, 
even though a real scene is 3D, our model is a collection 
m of 2D images as we consider projections of the 3D scene 
into cameras 2D coordinates. 

Let H Q (n) be the space of 2D functions, / : R 2 
R, on the interval [0, l] 2 C R 2 , where Q is the number 
of foreground objects and ft = = 0, . . . , Q — 1} 

1 The extension of our analysis to the scenes with C a reg- 
ular surfaces is straightforward. 
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Fig. 2 A sample function in 

denotes the foreground objects. We define / G 7-L®(tt) 

as 



f{tiM) 



{i 



if 3z : (ti,t 2 ) e ^ 
otherwise 



(i) 



Our RD analysis is performed on H 1 ^) where ft = 
{i?o}- The extension to multiple foreground objects fol- 
lows naturally. For the sake of clarity, we skip super- 
script notation and represent this class by H(f2). Fig- 
ure [2] shows a sample function from H(f2). This figure 
shows one arbitrary shape foreground object on a flat 
background as it is projected into a 2D camera plane. 

In addition to our simple scene model, we describe 
now our camera configuration model. Let us denote as 
B^(V) a configuration with p reference cameras and 
q equally spaced intermediate views between each two 
reference views. Then, V is the set of intrinsic and ex- 
trinsic parameters for reference and virtual cameras. It 
is defined as V = {(^, R^ Tj) : i = 0, . . . ,p — 1} U 
{(ApRj,Tj) : j = 0, . . . , (p - l)q}, where A, and R { 
are respectively the intrinsic and rotation matrices of 
ith. reference camera and T$ is its corresponding trans- 
lation vector. The same parameters for virtual cameras 
are given by A!- , R'- and Tj . Figure fll shows a multi- 
view coder that corresponds to a B^(P) configuration. 
In this paper, we consider that a texture image and a 
depth map are coded and are sent to the decoder for 
each reference view. In our camera configuration B^(V), 
we have p pairs of texture images and depth maps to be 
coded at each time slot. The number of coded views is 
given by system design criteria or rate-distortion con- 
straints l26l. 



2.3 Rate-distortion framework 

Let us define three classes of signals T C L2(M 2 ), V C 
L2(M 2 ) and V C ^(M 2 ) as reference images, virtual 
views and depth maps, respectively. Then, define T 
as the class of all / = {(U,di) : U G T,di G £>, i = 
0, . . . ,p — 1} and similarly, Q as the class of all g = 
{(U,Vj) :UeT,Vj G V,i = 0, l,j = 0, ...,q- 



1}. Here, T represents all the coded data and Q indi- 
cates the set of all reference and virtual views that are 
reconstructed at the deocer. 

A typical multiview video coding strategy consists 
of at least three building blocks namely, encoder, de- 
coder and rendering algorithm. Consider a texture en- 
coding scheme £7- : T —> {1,2,..., 2 Rr } and similarly 
a depth encoding scheme £ x> ' V — >> {1, 2, . . . , 2^}, 
where R r = YnZl &u and Rv = YnZl Rd x are the 
total number of allocated bits to texture and depth in- 
formation, respectively. This represents a total rate R = 
Rj- + Rv bit at the encoder. Correspondingly, we call 
the texture and depth decoders as Fj- : {1,2,..., 2 Rr } — > 
T and Fx, : {1, 2, . . . , 2 RtD } V. Finally, we denote the 
rendering scheme as T : T — » Q. Each rendering scheme 
has two parts: first, the projection into intermediate 
view using a few closer reference views and their as- 
sociated depth maps and second, filling the holes that 
are not covered by any of these reference views. In this 
paper we are using only the two closest reference views 
for rendering. Furthermore, we assume in our theoreti- 
cal analysis that we have no hole in the reconstructed 
images. Thus, rendering becomes a simple projection 
of the closer reference views on an intermediate view 
using depth information. 

Let us denote the decoded data as / = Fn(£ji(f)). 
The distortion in the rendered version of the data, g = 
T(/), and the original version, g = T(/), is given b}Q 



D(g,g) =J2 \\ti-iih + J2 W v i 



(2) 



3=0 



We finally define the distortion of the coding scheme 
as the distortion of the encoding algorithm in the least 
favorable case, i.e., 



D. 



£,r,r 



(R) 



sup D(g,g). 

see 



(3) 



When the encoding, decoding and rendering strategies 
are clear from the context we use a simpler notation 
D(R) and call it the distortion-rate (DR) function. 

3 Theoretical analysis 

In this section we propose a DR function based on our 
simple model of scenes We first consider a sim- 

ple camera configuration B\(V) with only one refer- 
ence view and one virtual view. Then we extend anal- 
ysis to more virtual views with camera configuration 
B\(V) and to more reference views with configuration 
B^(V). For each class of functions the RD analysis is 

2 In this paper we consider the £2 distortion. However ex- 
tensions to other norm losses is straightforward. 
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built in the wavelet domain where the distortion is the 
distance between the original and coded wavelet coef- 
ficients. The distortion in wavelet domain is equal to 
the distortion in the signal domain when wavelets form 
an ort honor mal basis, while such a sparse representa- 
tion of our virtual and reference views simplifies the 
RD analysis. Assuming that coding has negligible ef- 
fect on the average signal value, then we can ignore the 
distortion in the lowest frequency band. Therefore, in 
the following analysis we only focus on the distortion in 
high frequency band coefficients. In all the proofs, we 
assume that the wavelets have a finite support of length 
i and that their first moments are equal to zero. 



Theorem 1 The coding scheme that uses wavelet-based 
texture coder and uniform quantization depth coder, ach- 
ieves the following DR function on scene configuration 
H 1 (r2) and camera setup B\(V) 



D(R u R d )~ 0(2^2 



2r,aR t 



K 



AZ 



Z min [2^+AZ] 



where R t and R d are the texture and depth bit rates, 
K — A'R'\T— T'\ depends on camera parameters, AZ = 



and Zrr 



are the maximum and 



minimum depth values in the scene, a 2 is the reference 
frame variance and \i, a and j3 are positive constants. 

Proof For the camera configuration B\ (V) we have g — 
^o)} with one reference view and one virtual view. 
In all proofs we consider that there is no occluded re- 
gion for the sake of simplicity. Inspired by [32 , we con- 
sider the same quantization level for each wavelet coef- 
ficient. This suboptimal choice of quantization will only 
affect constant factors of the DR function and will not 
change the final upper bound equation. In addition to 
this, for depth map coding, we assume a quantization- 
based coder that simply splits depth image into uniform 
square areas and for each square the average depth is 
quantized and coded. Therefore, if we assign b bits for 
coding each wavelet coefficient in the reference frame 
and b' bits for coding each depth value, there will be 
three sources of distortion after decoding and rendering 
at the decoder side, 

First at every scale s the number of non-zero wavelet 
coefficients is 3 x df2£2 s where df2 is the boundary 
length of £2 in vo and 3 factor is because of three wavelet 
bands. Using the definitions of section |2.1[ the mag- 
nitude of coefficients at scale s of a standard wavelet 



decomposition is bounded by 

|Ci,m,n 2 l < 

/ / |/(tl,t2)||^, ni , n2 (tl,t2)|^1^2< 

J to Jt' Q 

2 s / |0(2% - n)i/j(2 s t 2 - n)\dhdt2 < 

J t Jt' 



We have similar results in case of \C 2 n2 1 and \C^ |. 
By assigning b bits for coding each coefficient, clearly all 
the coefficients at scale s, 2~ s < 2 _6_1 , will be mapped 
into zero. Therefore, the first source of coding distortion 
Di is 



Di = UdQ ]T 2 s x (2~ s ) 2 = ci2" 



(4) 



s=b+2 



where c\ — 12£dQ. Note that a factor of 2 is added here 
because the error of skipping small wavelet coefficients 
affects distortion in both to and similarly. 

Then, depth map quantization also introduces dis- 
tortion as it leads to shifts in foreground objects. Re- 
call that we are calculating distortion in the wavelet 
domain. Consider si as the largest scale with wavelet 
support length that is smaller than the amount of shift 
in foreground object. Non-zero wavelet coefficients at 
scales larger or equal to s± suffer from position changes 
due to depth coding. Assume Aq as the maximum posi- 
tion error in vo with a b' bits quantization-based depth 
coder. Then we have £2~ Sl ~ 1 < A < £2~ Sl . Hence, 
our second source of error, D 2l is 

6+1 

D 2 = 2x3£df2 2 S (2~ S ) 2 =ci(2- Sl -2- 6 - 1 ). (5) 

8=81 + 1 

Here the factor 2 is due to shift of significant coeffi- 
cients. 

Finally distortion is generated by quantization of 
non-zero coefficients. Using the definition of b and si, 
for the reference frame, to, we have large coefficients 
quantization error in s < 6+1 and for virtual view, 
^o, this happens at s < si. Thus, for the last source of 
distortion we have 



b+l 



D 3 = 3£df2[Y^ 2 j (2- 6 - 1 ) 2 + 2 S (2- 5 - 1 ) 2 ] 



s=l 

-b 1 0S10 — 2b\ 



s=l 



(6) 



Ci(2-° + 2 Sl 2- 
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Using Q, ([5| and |6| the total distortion is 



D = ci[2~ 



-6-1 



2 S1 2" 25 ]. 



From si and Z\o definitions we have s\ < b and s\ > 
log Z\q 1 — 1. Therefore, we can simplify the above equa- 
tion as 

D = O(2 b + A ). 

The first term only depends on texture coding errors 
and the second term on depth quantization. We replace 
the texture coding term with a simple distortion model 



ficr 



-olR 



133] where /i and a are model parameters, a 2 
is the source variance and R is the target bit rate. Using 
the formulation of maximum shift error in [21] for the 
depth distortion term we finally have 



D(R u R d ) = 0(2iia 2 2- aRt 
+ A'R'\T-T' 



J max ^min 



^min[2^ Rd ~\~ Z max Z m in\ 



(7) 

where f3 is another model parameter that depends on 
depth coding method. 

We now extend the above analysis to more complex 
camera configurations. We first consider q virtual views 
in a Bq(V) configuration. 

Theorem 2 The coding scheme that uses wavelet-based 
texture coder and uniform quantization depth coder, ach- 
ieves the following DR function on scene configuration 
and camera setup B^(V) 

D(R t ,R d )~0((q+l)v<J 2 2 aR t 

9-1 



3=0 



AZ 



Z min [2^+AZ] 



). 



where Rt and Rd are the texture and depth rates, Kj = 
AjRj \T — Tj\j for j = 0, . . . , q — 1 depends on camera 
parameters, AZ = Z max - Z min , Z max and Z min are 
the maximum and minimum depth values in the scene, 
a 2 is the reference frame variance and \i, a and (3 are 
positive constants. 

Proof With q virtual cameras the three sources of dis- 
tortion in the proof of Theorem [l] turn into 



D 1 =c 1 (q+1)2- 



(8) 



and 



D 3 = Cl (2- 



q-l 6+1 q-1 

D 2 = 2x3£df2j2 £ 2 S (2- S ) 2 =c 1 (^2-^-g2- 6 - 1 ) 

j=0s= Sj +l j=0 



Q-1 



-26 



£ 2 * 



(9) 



(10) 



We have Sj < b and Sj > log A ■ 1 — 1 for j = . . . q — 1, 



thus 



D = 0((q + l)2 b + Y, A J 



Here, we have simply used the fact that the error 
in the virtual views augments with the number of such 
views. The DR function is then obtained by following 
exactly the same replacements as in the proof of Theo- 
rem m 

Finally we extend the analysis to configurations with 
more reference views. We assume that we have equally 
spaced reference cameras and virtual views, and that 
the number of intermediate views is uniform between 
every two reference cameras. A weighted interpolation 
strategy using the two closest reference views is em- 
ployed for synthesis at each virtual view point. The 
weights are related to the distances between correspond- 
ing virtual view and right and left reference views sim- 
ilarly to [19 . Theorem |3] provides the general DR func- 
tion in a general camera configuration with p reference 
views and (p — l)q virtual views. 



Theorem 3 The coding scheme that uses wavelet-based 
texture coder and uniform quantization depth coder, ach- 
ieves the following DR function on scene configuration 
and camera setup Bq(V) 

D(R tQ , . . . : R tp _ 1 ,R do: . . . J Rd p _ 1 ) ~ 



i=0 



dj 



AZ 



j=0 



Kj r 



U Z min [2? R «i + AZ] 1 

AZ ]), 



Z min [2P R *r + AZ\ 



where R ti and R^ are the texture and depth rates for 
the ith reference view, AZ = Z max — Z min , Z max and 
Zmin are the maximum and minimum depth values in 
the scene, a 2 is variance of the ith reference view and 
H, a and j3 are positive constants. Also, d indicates the 
distance between each two reference cameras and djj 
and dj^ r are the distances between jth virtual view and 
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its left and right reference cameras. Similarly, we have 
K jt i = AjRj\Ti - Tj| and K jjr = A'-R'-\T r - Tj| that 
of camera parameters. 



where 



Proof First, using Theorem [2j we can write the distor- 
tion of a reference view, r, and the q virtual views on 
its left as 



D(R tr ,R dr ) 



0(^2^ 



£ 

j=0 



AZ 



AZ] 
(11) 



]) 



Clearly, the first and second terms define the distortion 
at reference and virtual views, respectively. By adding 
another reference view, Z, and using a weighted aver- 
age of the two closest reference views for synthesizing 
virtual views we have 

D(R tr , Rt t , Rd r •> Rdi ) = 



3=0 

{ d JAf Wr2aRtr+K . 



AZ 



d 



3,r 



4,„[2^ +AZ] 



]) 



(12) 



where d indicates the distance between the two refer- 
ence cameras and d^\ and dj jV are the distances be- 
tween jth virtual view and its left and right reference 
cameras. Here, our weights are simply related to the 
distance between virtual view and its neighbor refer- 



ence views. Finally, summing up the terms of (12) for 



all reference views, leads to the distortion in Theorem 

m 

The above rate-distortion analysis is performed on 
However, the extension to multiple foreground 
objects is straightforward and only adds constant fac- 
tors to the RD function. 



4 RD Optimization 

In this section we show how the analysis in Section [3] 
can be used for optimizing the rate allocation in multi- 
view video coding. Using Theorem|3j the rate allocation 
problem turns into the following convex nonlinear mul- 
tivariable optimization problem with linear contraints 



arg min g t (^t) 
Rt,Ftd 



9dC^d) 

such that 



(13) 



9t 



i=0 



(p-l)q 

A) = E Kf)^3,r ^ 
3=0 



Z min [2^r + AZ] 
^ i 



and R is the total target bit rate. The convexity proof 
is straightforward since the above optimization problem 
is sum of terms in the form a2~ bx , which are convex. 
Therefore it can be solved efficiently using classical con- 
vex optimization tools. Note that the above optimiza- 
tion problem is for the general camera configuration 
B^(V). The rate allocation for simpler configurations 
is straightforward by replacing the objective functions 
with terms from Theorem [l] and |2| We can finally note 
that the rate allocation strategy is only based on en- 
coder side data. 

The last issue that we have to address is adjustment 
of the model parameter. There are three parameters, /i, 



a and f3 in (13) that we estimate using the following of- 



fline method. Using the first texture and depth images, 
we estimate the model parameters by solving the fol- 
lowing regression 



mm T\D(R k ) 



D*(R k )\ 



(14) 



where n is the number of points in the regression and is 
further discussed in the next section and D{R^) is the 
distortion obtained by our rate allocation strategy of 



Eq. (13) with target bit rate Rk and D*(Rk) is the best 
possible allocation obtained by a full search method at 
the same bit rate. 



5 Experimental Results 

In the previous sections, we have studied the bit allo- 
cation problem on simple scenes and extracted a model 
for estimating RD function of a DIBR multiview coder 
with wavelet-based texture coding and a quantization- 
based depth coding. This section studies the RD behav- 
ior and the accuracy of proposed model on real scenes 
where JPEG2000 is used for coding depth and reference 
images. 

We use the Ballet and Breakdancers datasets from 
Interactive Visual Group of Microsoft Research [34] . In 
our simulations gray-scale versions of these datasets are 
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used. These datasets contain 100 frames and all the nu- 
merical results in this section are the average on the 
three frames from beginning, middle and end of these 
sequences, i.e., frames with temporal indices 0, 49 and 
99. The camera intrinsic and extrinsic parameters, V , 
and the scene parameters, Z m i n and Z maxi are set to 
the values given by datasets. In cases where parameters 
are changed to study the model under some special as- 
pects, we mention the parameter values explicitly. 



of DIBR coder in terms of PSNR for Ballet and Break- 
dancers datasets. The estimated curve is generated by 



solving the optimization problem provided in Eq. (13) 



In an offline stage using Eq. (14) we adjust /i, a 



and (3 parameters in Eq. (13) at four bit rates, i.e., 
n = 4, for each dataset. The parameter values are set to 
(0.9,20.5,8.5) for Ballet and (0.9,30.0,2.7) for Break- 
dancers. These values are fixed all over this section for 
the different camera configurations. 

In the following sections we study the RD model of 



Eq. (13) for rate allocation in different camera config- 



urations, B{(V), Bq(V) and B^(V). As a comparison 
criterion we use the optimal allocation that is obtained 
by rendering all the intermediate views and searching 
the whole distortion-rate space for the allocation with 
minimal distortion. 

As we want to keep our model independent of any 
special strategy for filling occluded regions, all occluded 
regions are ignored in distortion and PSNR calcula- 
tions. 



5.1 B\(V) configuration 

We start with B\(V) camera setup, a simple configura- 
tion with one reference view and only one virtual view. 
As reference and target cameras, we use the cameras 
and 1 of the datasets, respectively. Thus, all camera- 



related parameters in Eq. (13) are set accordingly. 



A DR surface is first generated offline for the desire 
bit rate range to generate the distortion benchmark val- 
ues. In our study, R t and Rd are set between 0.02 and 
0.5 bpp with 0.02 bpp steps. It means that R t and Rd 
axes are discretized into 25 values. Since the images are 
gray and we are coding only one reference view and 
one depth map, this range of bit rate is pretty reason- 
able. The DR surface is generated by actually coding 
the texture and depth images at each (R tl Rd) pair and 
by calculating distortion after decoding and synthesis. 

Then, for each target bit rate, R, the optimal rate al- 
location is calculated by cutting the above surface with 
a plane R t + Rd = R and minimizing the distortion. If 
the minimum point occurs between grid points (because 
we have a discretized surface) bicubic interpolation is 
used to estimate the optimal allocation. Here, R is set 
between 0.1 to 0.5 bpp with 0.01 bpp step. Figure [3] 
provides distortion curves of compression performance 



with the proposed RD model. The final PSNR results 
are averaged over frames 0, 49 and 99 of these datasets. 
The average differences between the model-based and 
optimal curves are 0.05 dB and 0.06 dB for Ballet and 
Breakdancers sets, respectively. Also, the maximum loss 
in PSNR in our model-based rate allocation is 0.11 and 
0.13 dB, respectively. 

Table [I] shows the percentage of the total rate that 
is used for coding texture for different target bit rates. 
Clearly our model-based allocation follows closely the 
best allocation. Figure [4] further shows the best and 
model-based allocations versus bit rate in terms of R t 
percentage. Additionally, two dotted curves are pre- 
sented which are the higher and lower bounds on R t 
allocation where the PSNR loss compared to the best 
allocation remains below 0.2 dB. 

We study now the performance of a priori given rate 
allocations, which are commonly adopted in practice. 
We consider several such allocations, where the values 
of R t relative to the total budget spans a range of 20 
to 80 %. Table [| shows the average PSNR loss com- 
pared to the best allocation in these cases. All these 
results are the average over frames 0, 49 and 99 in both 
datasets. We compare the performance of the rate al- 
location estimated with our RD model and we show 
that our allocation is always better. Figure [4] further 
shows that using a model-based allocation instead of 
a priori allocation is more important at low bit rates 
or in images with close to camera objects (like Ballet). 
Depending on the dataset, the best a priori allocation 
occurs at different R t percentages. In our proposed allo- 
cation, the results are close to optimal in both datasets 
as the model adopts to the scene content. The last two 
rows of Table [2] shows the average benefit of our model 
compared to a fixed rate allocation. 

Finally we study the effect of the distance of virtual 
views on the rate allocation. We vary the distance be- 
tween reference and virtual view from 1 to 20 cm by 
only changing the value of the x coordinate in the T' 
translation vector of the virtual camera. We further fix 
the total bit rate to R = 0.24 bpp. Figure [5] shows the 
best rate allocation as a function of the distance of the 
virtual view. Again, these results are the average over 
frame 0, 49 and 99 of Ballet and Breakdancers datasets. 
Intuitively, for a given error in depth maps due to cod- 
ing effects, rendering distortion should be smaller in 
closer virtual views than farther ones. It means that for 
rendering far views we need more accurate depth infor- 
mation for rendering far views. Alternatively, texture 
coding distortion plays a more important role in closer 
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Breakdancers 




Total bit rate, R (bpp) 




Total bit rate, R (bpp) 



Fig. 3 Comparison of coding performance for B\(V) using the proposed allocation method and the best allocation in terms 
of PSNR at rates ranging from 0.1 to 0.5 bpp; Ballet (left) and Breakdancers (right). 



Table 1 Rate Allocation Results for B\(V) - Comparison between allocation with the proposed model and the optimal 
allocation, in terms of Rt percentage of the total rate. 



Total bitrate 


0.2 bpp 


0.3 bpp 


0.4 bpp 


0.5 bpp 


Ballet 


optimal 


67.83% 


57.78% 


53.33% 


37.33% 


model-based 


54.61% 


49.61% 


48.46% 


48.36% 


Breakdancers 


optimal 


80.91% 


75.56% 


70.32% 


73.33% 


model-based 


75.72% 


74.54% 


75.09% 


75.78% 



Breakdancers 




Total bit rate, R (bpp) 




Total bit rate, R (bpp) 



Fig. 4 Rate allocation results of B\(V) using our proposed method and the optimal allocation in terms of Rt percentage of 
total rates ranging from 0.1 to 0.5 bpp; Ballet (left) and Breakdancers (right). The black dashed curves show the bounds within 
which the difference in PSNR quality with optimal allocation remains less than or equal to 0.2 dB. 



views. This is shown in Figure [5] as the R t percentage 
decreases by increasing the distance of the virtual view. 
For Ballet dataset we however observe an increase in R t 
after 12 cm. It is due to the nature of this scene and 
to the fact that we use only one camera for rendering 
virtual views. In this sample there are two foreground 
objects which are close to the camera, and, beyond a 
given distance, they move out of view boundaries and 
mostly background pixels remain. Clearly depth cod- 
ing errors is less important for background regions that 
are far from the camera. We also show in Figure [5] the 
model-based allocation using our RD equation in Eq. 



(13) where we only change T' . Therefore, the second of 



the distortion grows with the distance which means that 



increasing R& yields smaller distortion comparing to in- 
creasing R t . The average PSNR penalty of our model- 
based allocation is 0.05 dB and 0.03 dB for Ballet and 
Breakdancers, respectively. 



5.2 Bq(V) configuration 

In this section we study the allocation problem for cam- 
era configuration with multiple virtual views. The cam- 
era 4 of Ballet and Breakdancers datasets is used as the 
reference camera and six virtual cameras separated by 
1 cm are considered, three at each side of the reference 
camera. At each side the parameters of the virtual cam- 
eras are set according to camera 3 and 5, respectively. 
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Table 2 Performance penalty for fixed allocation in B\ (V) - Comparison between the proposed model and a priori allocation 
policies in terms of average and maximum differences to the best achievable PSNR at total rates ranging from 0.1 to 0.5. The 
column headers indicate the a priori allocation of Rt relatively to the total rate. 



Rt percentage 


20% 


30% 


40% 


50% 


60% 


70% 


80% 


our model 


Ballet 


Average (dB) 


1.43 


0.66 


0.29 


0.11 


0.07 


0.14 


0.35 


0.05 


Maximum (dB) 


2.20 


1.15 


0.63 


0.31 


0.21 


0.39 


0.77 


0.11 


Breakdancers 


Average (dB) 


1.97 


1.14 


0.68 


0.40 


0.21 


0.10 


0.06 


0.06 


Maximum (dB) 


3.23 


2.08 


1.33 


0.77 


0.44 


0.16 


0.11 


0.13 


Overall 


Average (dB) 


1.70 


0.90 


0.49 


0.26 


0.14 


0.12 


0.21 


0.06 


Maximum (dB) 


2.72 


1.62 


0.98 


0.54 


0.33 


0.28 


0.49 


0.12 





Fig. 5 Rate allocation results of B\ (V) using the model-based and the optimal allocation in terms of Rt percentage at a total 
rate of 0.24 bpp; Ballet (left) and Breakdancers (right). The virtual view is projected at 1 to 20 centimeters from reference 
view. 



The optimal allocation process is obtained similarly 
to section [5TT| The optimal RD surface is generated of- 
fline, for R t and Rd rates between 0.02 and 0.5 bpp with 
0.02 bpp steps. Then, at each bit rate R, the best alloca- 
tion is calculated using interpolation over this RD sur- 
face. The model-based allocation is the result of solving 



Eq. (13) for Bq(V). The reported distortion is the av- 
erage distortion over all six virtual views and the refer- 
ence view and also over the three representative frames 
in each set, i.e., frames 0, 49 and 99. 



Figure [6] represents performance in terms of PSNR 
with respect to target bit rate, i?, where R varies be- 
tween 0.1 and 0.5 bpp. The two curves correspond to 
the best allocation and the model-based allocation. The 
amount of loss due using our model is 0.05 and 0.03 dB, 
on average, for Ballet and Breakdancers, respectively. 
Also, the maximum difference is 0.22 and 0.21 dB, re- 
spectively. Figure [7] shows the best and model-based 
allocation in terms of percentage of the total rate al- 
located to R t , for different values of R. Clearly our 
model again performs very close to the optimal allo- 
cation. This yields to clear improvements over a priori 
rate allocation as given in Table [3] in case of B\(V). 



5.3 B^(V) configuration 

We now consider the most general configuration, B v q (P), 
with two reference cameras (p = 2) and three equally 
spaced virtual views between them (q = 3). The cam- 
eras 4 and 5 are considered as the two reference views 
and Aj and Rp j = 1,2,3, for virtual views are set 
as the average of intrinsic and rotation matrices of our 
reference cameras. Each virtual view Vj is generated 
in two steps. If tt is the position of Vj, then each of 
the reference views are projected into 7r using depth 
map information. This step produces v^ r and as 
projection results from the right and left cameras, re- 
spectively. Next, we have 



Vn 



Hi, 



V 3,l 



(15) 



where d is the distance between two reference cameras, 
while djj and d^ r are the distances between Vj and the 
left and right reference cameras, respectively. 

The allocation problem in this case consists of dis- 
tributing the available bit budget between two refer- 
ence views and two depth maps. For comparison pur- 
poses, we calculate a DR hypersurface of the best al- 
location with R tl1 Rt 21 R^ and Rd 2 ranging from 0.1 
to 0.6 bpp with 0.05 steps. Then for each target bit 
rate, R, the best allocation is the minimum of the re- 
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Breakdancers 





Total bit rate, R (bpp) 



Total bit rate, R (bpp) 



Fig. 6 Comparison of coding performance for B\ (V) using the model-based allocation method and the best allocation in terms 
of PSNR at rates ranging from 0.1 to 0.5 bpp; Ballet (left) and Breakdancers (right). 



Breakdancers 





Total bit rate, R (bpp) 



Total bit rate, R (bpp) 



Fig. 7 Rate allocation results of Bq(V) using the model-based and the optimal allocation in terms of Rt percentage at total 
rates ranging from 0.1 to 0.5 bpp; Ballet (left) and Breakdancers (right). 



Table 3 Performance penalty for fixed allocation in B\ (V) - Comparison between the proposed model and a priori allocation 
policies in terms of average and maximum differences to the best achievable PSNR at total rates ranging from 0.1 to 0.5. The 
column headers indicate the a priori allocation of Rt relatively to the total rate. 



Rt percentage 


20% 


30% 


40% 


50% 


60% 


70% 


80% 


our model 


Ballet 


Average (dB) 


0.54 


0.11 


0.12 


0.32 


0.67 


1.21 


1.91 


0.05 


Maximum (dB) 


1.26 


0.47 


0.35 


0.70 


1.23 


1.90 


3.13 


0.22 


Breakdancers 


Average (dB) 


2.03 


1.15 


0.66 


0.36 


0.16 


0.05 


0.03 


0.03 


Maximum (dB) 


3.57 


2.28 


1.38 


0.78 


0.37 


0.12 


0.08 


0.21 


Overall 


Average (dB) 


1.29 


0.63 


0.39 


0.34 


0.42 


0.63 


0.97 


0.04 


Maximum (dB) 


3.57 


2.28 


1.38 


0.78 


1.23 


1.90 


3.13 


0.22 



suiting curve from cutting this hypersurface with the 
hyperplane R tl + R t2 + Rdx + Rd 2 = R- 

Figure [8] compares the best allocation and the model- 
based allocation in Eq. ( 13 ) for Ballet and Breakdancers 



datasets and target bit rates ranging from 0.2 to 0.6 
bpp. Our allocation model yields to 0.05 dB loss in av- 
erage in both cases and a maximum loss of 0.17 and 
0.20 dB for Ballet and Breakdancers, respectively. Fig- 
ure [9] shows the best and estimated allocations in terms 
of the percentage of the texture bits (R tl + Rt 2 ) rela- 
tively to the total bit rate. The advantage of using our 
model over the commonly used strategy of a priori rate 



allocation is shown in Table |1J In the a priori alloca- 
tion the bit rate assigned to each reference view and 
depth map is equal. For instance, in B^(V), if the total 
bit rate is 0.4 bpp and the a priori allocation is 40%, 
R tl = R t2 = 0.08 and R dl = R d2 =0.11 bpp. Clearly 
our model outperforms the a priori allocation due to 
adapt ivity to content and setup. From Tables [2] to [4j we 
can conclude that the best performance of an a priori 
allocation strategy depends on the number of reference 
and virtual views and on the scene content. While our 
model-based allocation works well in all cases and gives 
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Fig. 8 Comparison of coding performance for B\ (V) using the model-based allocation method and the best allocation in terms 
of PSNR at rates ranging from 0.22 to 0.6 bpp; Ballet (left) and Breakdancers (right). 




Total bit rate, R (bpp) 



Total bit rate, R (bpp) 



Fig. 9 Rate allocation results of B^(V) using the model-based and the optimal allocation in terms of Rt percentage at total 
rates ranging from 0.22 to 0.6 bpp; Ballet (left) and Breakdancers (right). 

Table 4 Performance penalty for fixed allocation in B^(V) - Comparison between the proposed model and a priori allocation 
policies in terms of average and maximum differences to the best achievable PSNR at total rates ranging from 0.22 to 0.6. The 
column headers indicate the a priori allocation of Rt relatively to the total rate. 



Rt percentage 


20% 


30% 


40% 


50% 


60% 


70% 


80% 


our model 


Ballet 


Average (dB) 


0.93 


0.26 


0.06 


0.11 


0.35 


0.68 


1.15 


0.05 


Maximum (dB) 


1.49 


0.56 


0.21 


0.24 


0.65 


1.28 


1.79 


0.17 


Breakdancers 


Average (dB) 


2.15 


1.19 


0.68 


0.38 


0.20 


0.11 


0.08 


0.05 


Maximum (dB) 


3.16 


1.95 


1.21 


0.66 


0.47 


0.33 


0.21 


0.20 


Overall 


Average (dB) 


1.54 


0.73 


0.37 


0.25 


0.28 


0.40 


0.62 


0.05 


Maximum (dB) 


3.16 


1.95 


1.21 


0.66 


0.65 


1.28 


1.79 


0.19 



this opportunity to determine number of virtual views 
later at decoder side. 



6 Conclusion 

We have addressed the rate-distortion analysis of multi- 
view coding in a depth-image-based rendering context. 
In particular, we have shown that the distortion in the 
reconstruction of camera and virtual views at decoder 
is driven by the coding artifacts in both the reference 
images and the depth information. We have proposed a 
simple yet accurate model of the rate-distortion char- 



acteristics for simple scenes and different camera con- 
figurations. We have used our novel model for deriv- 
ing effective allocation of bit rate between reference 
and depth images. One of the interesting features of 
our algorithm, beyond its simplicity, consists in avoid- 
ing the need for view synthesis at encoder, contrarily 
to what is generally used in state-of-the-art solutions. 
We finally demonstrate in extensive experiments that 
our simple model nicely extends to complex multiview 
scenes with arbitrary numbers of reference and virtual 
views. It leads to an effective allocation of bit rate with 
close- to-optimal quality under various rate constraints. 



Rate-Distortion Analysis of Multiview Coding in a DIBR Framework 



13 



In particular, our rate allocation outperforms common 
strategies based on static rate allocation, since it is 
adaptive to the scene content. Finally, we plan to extend 
our analysis to multiview video encoding where motion 
compensation poses non-trivial challenges in rate allo- 
cation algorithms due to additional coding dependen- 
cies. 
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